The data set that I’ll be working with comes from a study done by Cortez et. al, “Modeling wine preferences by data mining from physio-chemical properties”. The data that is available from the study was separated into two data sets, where one focuses on white wines and the other on red wines. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). This analysis will focus on the white wine data set, and we will see what findings we can gather from the data and see if any of the input variables are correlated with the quality that the experts gave the wines.
Now, lets load in the data set and see the details of the data. We’ll also check to see if there are any null values just in case. If we find any null values, we can work around that in our upcoming analysis.
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
Some things to note:
There are 4898 different wines in the data set, with 12 variables being measured for each wine.
From the info link at https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt, the variables are as follows:
g/dm^3 = grams/Liter
mg/dm^3 = ppm
1 - fixed acidity (tartaric acid - g / dm^3):most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid (g / dm^3): found in small quantities, citric acid can add ’freshness’and flavor to wines
4 - residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides (sodium chloride - g / dm^3): the amount of salt in the wine
6 - free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density (g / dm^3): the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulfates (potassium sulfate g / dm^3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
11 - alcohol (% by volume) : the percent alcohol content of the wine
12 - quality: ranking from experts on a scale of 0-10, where 0 is worst and 10 is best
Now lets see the descriptive stats of the data set just to get some more information on what we’re dealing with.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality
## 3: 20
## 4: 163
## 5:1457
## 6:2198
## 7: 880
## 8: 175
## 9: 5
*Things to note:
On average, wines rank between 5 and 6 on the quality scale and there are no wines in the data set that are below 3 and above 9.
total.sulfur.dioxide and free.sulfur.dioxide are going to be correlated since total sulfur content will be dependent on free sulfur content
density changes are pretty small (thousandths), so we may need to adjust calculations in further analysis
There is a really sweet wine in the data set based on the max value of 65.800 grams/Liter, based on the description of the residual sugar variable stating that wines above 45 g/L are considered sweet.
Looks like there are wines with a significant amount of free SO2 (>50 ppm), so it’ll be interesting to look at the ratings for those wines since the descriptions also state that wines with free SO2 >50 ppm are detectable by their taste
Research on volatile acidity levels in wine showed that levels are regulated by the federal government, where you cannot legally have more than 1.1 g/L of acetic acid in white wine. Detection threshold for acetic acid in beverages is about 0.7g/L. Looking at the table above, we can see that there are some wines that are well above detection threshold and some that are at the legal maximum.
## [1] 1
## [1] 868
## [1] 18
We can see that there is only 1 sweet white wine, there are 868 wines that have a free SO2 levels greater than 50ppm, and there are 18 wines that have a volatile acidity level greater than 0.7 g/L. Lets save these wines in a table for further analysis at a later point and keep these wines in mind.
I made a data frame above (odd_wines) to keep track of wines that have features that fall outside normal levels which correlate with a detectable change in taste, according to the info provided with the data set. I’m going to make a general analysis first and see what I can find by looking at the data as a whole, and then follow it up by looking at wines with “detectable” features and see if the features that they have correlate with higher or lower quality ratings.
For now, lets take a look at distributions of the data.
Things to note:
Quality is normally distributed, where the mean quality is around 6
Some outliers at 3 and 9, meaning that there are some wines that are both really bad and really great
Now lets look at the distribution of alcohol content amongst the wines
Things to note:
Distribution of alcohol content is right skewed, where the mean is around 10% by volume
White wines in data set are on the lower end of ABV according to Wikipedia, where average ABV is 12.5%-14.5% wines. This makes sense though, since white wines typically have a lower ABV content than reds, which the Wikipedia average may not be taking into account
Residual sugar is monstly under 10.0 g/dm^3 for the majority of the wines in the dataset.
Majority of wines have chloride concentrations between 0.025 and 0.05 g/dm^3.
Sulphates content for the wines tends to be within the range of 0.35 and 0.55 g/dm^3 snd looks like a normal distribution.
Most wines have a total SO2 content between 100 and 200 mg/dm^3(ppm)
Most wines have between 0.25 and 0.5 g/dm^3 of citric acid added to them.
Most wines have a density ranging from 0.990 to 0.9975 g/cm^3
Most wines have between 20 and 50 mg/dm^3 of free sulfur dioxide, which is under the detectable limit in wines by taste.
Most wines have between 0.2 and 0.3 g/dm^3 of volatile acidity, which is well under the detectable by taste limit of 0.7 g/dm^3 (g/L).
Fixed cidity is similar in distribution as volatile acidity, however it is on a different scale where amounts are greater than 1 g/dm^3 in fixed acidity concentration.
Now one last look at another variable, pH:
Things to note:
pH seems to be normally distributed, where the mean is around a pH of 3.2
White wines are somewhat acidic and are similar to pH in other soft beverages and alcoholic beverages.
Some wines in data set are below 3.0, which is approaching the pH of lemon juice and vinegar. These wines most likely have higher free volatility levels.
This data set consists of 4898 observations of wines with 12 features being measured:
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulfates (potassium sulfate - g / dm^3)
11 - alcohol (% by volume)
12 - quality (score between 0 and 10, worst to best respectively)
Going forward, I’d like to see what changes if any are correlated with a higher rating from critics. Based on the above analysis, we can see that there are a couple of factors that may be responsible for a different rating. Because rating is based on a subjective scale of taste, you may be able to see what wines ranked better or worse on average based on factors that may give the wines a different taste, such as residual sugar levels, alcohol by volume, free sulfur dioxide content, or volatile acidity levels.
I think if we also look at some features such as chloride and citric acid content, we can see if taste is affected by adding different levels of chemicals to the wines.
I created a few variables to add to the data set, one being total acidity which is simply the sum of volatile and fixed acidity.
The second variable I created was the bound sulfur dioxide variable, which is the difference of the total sulfur dioxide content and the free sulfur dioxide. This variable is a measure of the ionic form of sulfur dioxide (bisulfite) and is typically in solid salt form.
The third variable I made is the sum of the additives in g/dm^3, which consist of citric acid, sulfates, and chlorides. This is a feature that will be of interest in the future since these are described to have an effect on flavor and body of wines and may have a correlation with quality ranking.
There were no unusual distributions in the data observed above, and there were only a few minor changes I had to do to the data to clean it up since this data set was a tidy data set provided by Udacity. I just dropped a column that listed the wines in numerical order, and made the quality variable an ordered factor since it is a metric of rank for the data set.
Lets look at some bi-variate plots to see what insights we can gather from the data.
Lets get a quick summary of the data set by displaying a matrix of plots using ggpairs to see the relationships that the features have with one another.
The column labels 1-12 correspond to the following variables:
1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulfates 11 - alcohol 12 - quality
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
One thing that is interesting about the plot above is that there seems to be a trend where alcohol content is positively correlated with quality, where wines with a higher alcohol content are rated a bit higher on the quality scale. Looking at the summary statistics of the graph, we can see that the median and mean alcohol content increase with quality level from wines that have a quality ranking of 5 and up. Although it isn’t consistent with the lower end of the quality ranking spectrum, there might be some other underlying factors that may counteract the positive effect that alcohol content seems to have on quality.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.587 4.600 6.393 10.700 16.200
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
Based on the above visualization, we can see that there doesn’t seem to be a major difference in residual sugar content across different qualities of wines. One would assume that sweetness would affect the quality ranking of a wine since taste is taken into account when ranking the wines. However, there doesn’t seem to be any pattern here that we can use. We’ll revisit residual sugar later in our multivariate analysis to see if there might be a link to another feature that can be missed here.
Now lets look at free sulfur dioxide
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 13.25 33.50 53.33 47.50 289.00
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 18.00 23.36 30.50 138.50
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 22.00 35.00 36.43 50.00 131.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 24.00 34.00 35.65 46.00 112.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 33.00 34.13 41.00 108.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 28.00 35.00 36.72 44.50 105.00
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.0 27.0 28.0 33.4 31.0 57.0
Looking at the above plot and summary statistics, there doesn’t seem to be a big difference in free SO2 content between wines of different quality. As we stated above, SO2 above 50 ppm is detectable in wines and could affect the taste. However if we look at the median SO2 content for each quality of wine, we can see that none of them are over that threshold. There are some outliers that are well above 50ppm, however there doesn’t seem to be a significant effect on the overall quality of the wines. Perhaps free SO2 content doesn’t affect taste as much as anticipated based on the information provided with the data set. We’ll explore this as well in the future multivariate analysis portion.
Now lets look at pH and whether the acidity of the wine will affect the quality ranking
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.870 3.035 3.215 3.188 3.325 3.550
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.830 3.070 3.160 3.183 3.280 3.720
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.790 3.080 3.160 3.169 3.240 3.790
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.080 3.180 3.189 3.280 3.810
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.840 3.100 3.200 3.214 3.320 3.820
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.940 3.120 3.230 3.219 3.330 3.590
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 3.280 3.280 3.308 3.370 3.410
Based on the above graph, there does seem to be a very slight difference in pH across quality. Following the summary statistics, we can see that there seems to be a trend where the mean and median pH levels increase across quality ranking. This means that as the wine becomes more neutral and less acidic, quality ranking goes up. One thing to note is that pH is dependent on several factors, which the data set provides such as total acidity (fixed/volatile acidity), citric acid content, alcohol content, and sulfur dioxide content. This will be an interesting feature to look at in our multivariate analysis because of this dependence on a wines composition.
Lets look at the total acidity feature that we created earlier and see how its related to quality
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.415 6.820 7.705 7.933 8.857 12.030
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.450 6.745 7.310 7.511 7.920 10.910
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.690 6.660 7.140 7.236 7.730 10.550
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.110 6.550 7.030 7.098 7.567 14.470
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.370 6.505 6.980 6.997 7.460 9.450
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.125 6.475 7.040 6.935 7.490 8.570
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.960 7.260 7.360 7.718 7.640 9.370
According to the plot, there doesn’t seem to be a major trend in total acidity and quality. There does look like there’s a significantly more variability in the total acidity content of wines with a quality ranking of 3 than with other wines. Regardless, this doesn’t seem to be a very useful metric for now and we’ll have to explore this later.
Now lets look at additives, which is the sum of citric acid, chlorides, and sulfates.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6370 0.7812 0.8235 0.8648 0.9325 1.4440
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3740 0.6670 0.8170 0.8305 0.9835 1.4420
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3440 0.7420 0.8600 0.8714 0.9830 1.6540
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.7540 0.8550 0.8743 0.9630 2.2320
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3870 0.7538 0.8440 0.8669 0.9555 1.4170
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4060 0.7370 0.8160 0.8511 0.9405 1.3910
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7180 0.8710 0.9210 0.8794 0.9420 0.9450
There doesn’t seem to be a trend between the different quality wines based on the total amount of additives to the wine. They all tend to have amounts less than 0.9 g/dm^3 of total additives, and the mean and medians are all somewhat close to one another. Perhaps the sum of the additives isn’t as impactful as we would’ve thought.
One last feature that I wanted to look at was bound sulfur dioxide content and how it differs across quality
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.0 82.5 106.0 117.3 152.2 331.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 67.5 102.0 101.9 133.8 195.0
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 91.0 114.0 114.5 137.0 293.5
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.0 76.0 97.0 101.4 123.0 243.0
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 71.00 86.00 90.99 106.00 199.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42.00 71.00 84.00 89.45 104.50 159.50
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.0 62.0 82.0 82.6 96.0 112.0
There seems to be a slight trend with bound sulfur dioxide and quality, where mean and median bound sulfur dioxide content decreases across quality.
Out of pure interest, lets look at some scatter plots to see if our data set is consistent with real world chemistry concepts. The first features that I’ll be looking at is density and its relationship with alcohol content. Density of a fluid is a function of its mass divided by its volume. The density of water is typically used as a relative standard, where it is 1 g/cm^3. According to Wikipedia, the density of ethanol is measured as 0.7893 g/cm^3, which means that as you add more ethanol to water, its overall density should decrease. This should be reflected in our data since as you increase alcohol by volume content in a wine, its density should decrease from 1 g/cm^3. Lets see if this trend is present in our data:
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
As we can see above, you do see a negative correlation between alcohol content and density of a wine. Again, this is consistent with principles of physical chemistry, where an overall decrease in density should be observed with an increase in alcohol content.
Lets look at one more set of features and see if they’re consistent with real world science. In chemistry, pH is a measure of basicity or acidity of a solution and measured on a scale of 0-14, where 0 is strongly acidic, 7 is neutral(pure water), and 14 is strongly basic. Because of this, it would make sense that an increase in acidic content should cause a decrease in pH of a fluid. In context of our data, an increase in total acidity should be correlated with a decrease in pH of a wine. Lets see if this property of physical chemistry is consistent with our data.
##
## Pearson's product-moment correlation
##
## data: wine$total.acidity and wine$pH
## t = -33.116, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4503932 -0.4046240
## sample estimates:
## cor
## -0.4277827
As we can see above there is a negative trend in the data, where a decrease in total acidity is correlated with an increase in pH. This is consistent with the principles of physical chemistry, although the correlation isn’t as strong as one would be led to believe based on the cor.test() done between the two features. Because pH can be affected by other factors such as alcohol content and sulfur dioxide, it may be the reason why the cor.test is affected. This will be explored later in the multivariate analysis portion.
Lets look at some other scatter plots that focus on the relationships of some features of interest with either density or alcohol content to see if there are any other findings that coincide with science.
##
## Pearson's product-moment correlation
##
## data: wine$total.acidity and wine$alcohol
## t = -7.9076, df = 4896, p-value = 3.218e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.13986371 -0.08455661
## sample estimates:
## cor
## -0.1122971
##
## Pearson's product-moment correlation
##
## data: wine$total.sulfur.dioxide and wine$alcohol
## t = -35.15, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4709775 -0.4262443
## sample estimates:
## cor
## -0.4488921
##
## Pearson's product-moment correlation
##
## data: wine$residual.sugar and wine$alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$total.acidity
## t = 19.417, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2411894 0.2932005
## sample estimates:
## cor
## 0.2673897
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$total.sulfur.dioxide
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5094349 0.5497297
## sample estimates:
## cor
## 0.5298813
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
A lot to cover here, but lets start with alcohol content. First off, we can see that there really isn’t much correlation with total acidity content. However, there seems to be a negative correlation between alcohol content and both total sulfur dioxide and residual sugar content. The correlation between alcohol content and total SO2 is -0.449, and the correlation between alcohol and residual sugar is -0.451. Knowing how fermentation works, it makes sense that residual sugar should decrease with an increase in alcohol content, since sugars in wine are converted to alcohol by bacteria during the fermentation process. Therefore the less free sugar in the wine, the more alcohol that should be present in the wine after fermenting it. The relationship between total SO2 and alcohol is a bit more tricky, since the information on their relationship isn’t as readily available. Regardless, it is something to look at in the future when thinking about how a wines alcohol content and its relationship to other features go into the quality ranking of a wine.
Now lets discuss the relationship that density has with the same variables that we discussed above. Looking at the plots, there doesn’t seem to be a huge positive correlation with total acidity. We can see a much stronger relationship when looking at total sulfur dioxide and residual sugar. Because total sulfur dioxide includes free SO2 gas(0.0026g/cm^3) and its salt form bisulfite(1.48g/cm^3), the addition of these components to a wine seem to result in an overall increase in density. This is also true of sugar(1.50 - 1.65 g/cm^3) when assuming that adding a more dense component to wine will result in an increase in overall density.
As we mentioned before, we singled out some outliers of wines and put them in another data frame, odd_wines. Lets look at some of the plots that we explored before when looking at the entirety of the data.
Getting a closer look on some of the more extreme wines of the data set helps to gather a bit more insights into what makes a quality wine. The box plots are very similar to the ones earlier that we based on the data set as a whole. One thing that we can gather is that there is a lot of variance in wines with a quality of 3 in respect to the features that we looked at above. They tend to be more basic, tend to have more free SO2, and have a low residual sugar count. These factors could explain why those wines rank so low on taste, since these should be detectable when drinking the wine. As far as wines on the higher end, there’s not too much to see other than what we saw before where median alcohol content increases with quality. Whats interesting to note is that although these wines have one or more features that are high enough in quantity that they should be detectable and affect taste, they mirror the same overall trend in quality when looking at all the wines.
Some things to look at moving forward:
Density seems to be simply a factor of multiple components in a wine and might not be an important factor to look at in further analysis. From the data, it seems like its more of an aggregate of several features and isn’t necessarily suited to give an answer to our question of what features impact a wines quality rating the most.
Alcohol content is definitely something we want to look at moving forward in terms of analyzing what make a wine have a better quality ranking because of its somewhat strong correlation with quality.
Factors associated with alcohol content should also be looked at as well, which includes residual sugar and total SO2 content. I’ll also be looking at additive content of the wines and see if there’s any link there
Some features don’t seem to be linked to quality at all on the surface, we’ll continue to explore this when doing multivariate analysis.
Now lets do some multivariate analysis to help us see if some of the dead ends we saw before are really dead ends. First off, lets look at alcohols relationship with other features in respect to quality rating.
Now lets look at the ratio of acidity to alcohol by volume and also the ratio of total SO2 to alcohol with respect to pH and see what we find.
One last set of plots, which will look at the relationship between additives and alcohol, colored by quality ranking. The three plots will look at the entirety of the data set, a subset of the worst rated wines (3,4), and a subset of the best rated wines (8,9).
Things to note:
*As seen before, residual sugar decreases as alcohol increases but stays near 6 g/dm^3 for higher rated wines.
Free sulfur dioxide in a wine may be more important than previously seen, where higher rated wines are closer to 50 ppm, but don’t necessarily cross it. Recalling back to the beginning, free SO2 is detectable past 50 ppm although it helps preserve the wine. It looks like staying in a range between 25 - 45 ppm SO2 is a trend amongst better wines
Volatile acidity has a weaker relationship, although you can see that too much volatile acidity and not enough alcohol are correlated with a low ranking on the quality scale.
Additives don’t appear to have a big impact on wine quality rating, however it is interesting how it appears that higher rated wines have a total additive concentration ranging between 0.8 and 0.9 g/dm^3.
Best tasting wines fall between a pH of 3.0 and 3.4, and it doesn’t seem like pH is significantly affected by the features above.
Ratio of total acidity to alcohol by volume seems to be best when its under 0.7, where anything above is associated with a lower quality.
Ratio of total SO2 to alcohol seems to be best when it is around 10-12, where any deviation from that seems to be associated with lower quality
*Looking at the sum of additives in wine doesn’t seem to have a major impact on the overall quality of a wine with respect to alcohol. The additive content is pretty similar across the board for all quality types. For the most part, you can make the assumption that as long as the wine has a total additive concentration under 1 g/dm^3 then it is more likely to rank higher.
First off, I wanted to show the distribution of quality amongst the dataset. Quality seems to have a normal distribution, wherewe can see that the majority of the wines are between 5 and 6 and that there are no wines that are absolutely terrible (1 or 2) or wines that are a perfect 10. In fact, there are very few wines that have a rank above 8 and below a 5 so one would assume that you have to do something extraordinary to rank on the extreme ends of the ranking spectrum. This led to the question of what features are most strongly associated with quality. Further analysis into the features was done to show the relationship between them and quality.
When exploring the data, I found that alcohol content was the strongest indicator of wine quality, where the median alcohol percentage increased with an increase in wine quality ranking. You can also see the mean alcohol content increase over quality (as denoted by the red x). Although we can see that median alcohol content for wines with a 3 or 4 ranking are similar to that of a wine with a 6 ranking, the general trend is that an increase in ABV is correlated to an increase in quality. This can be attributed to other factors that affect taste,like sugar content, and other additives to preserve flavor or add a certain property to the wine.
Finally, the last series of plots to show the relationship alcohol has with different features that are associated with changes in tase of a wine. We can see above that there are some trends that the extreme ends of the wine quality show and that may be the keys to making a great wine or avoiding the creation of a terrible wine. First off, we can see that residual sugar decreases as alcohol increases However we now see that higher rated wines (8,9) fall within a residual sugar range of about 4 - 7 g/dm^3 and low rated wines (3,4) tend to be below 5 g/dm^3 and have an abv less than 11%.
We can also see a relationship with free sulfur dioxide, where higher rated wines are closer to 50 ppm. As previously discussed, when free SO2 crosses that threshold of 50 ppm, it leaves a detectable quality that can affect the taste of the wine. Although it serves a purpose in wine production, too much or too little seems to correspond with a decrease in quality.
Volatile acidity seems to impact wines with lower alcohol contents, since wines with lower alcohol contents and average volatile acidity levels rank low on the quality scale. If you look down the alcohol scale, you’ll see that volatile acidity levels don’t really change too much and it doesn’t seem to impact the rating.
Additives, which include citric acid, chlorides, and sulfates, don’t appear to have a big impact on wine quality rating. These factors are related with different tastes since they can affect the chemical properties of the wine such as pH or density, so it’s interesting that an aggregate metric of these factors doesn’t really show a trend. When looking at them individually, there wasn’t muchof a trend there either so perhaps this is simply just a dead end. One thing to note is how it appears that higher rated wines have a total additive concentration ranging between 0.8 and 0.9 g/dm^3, so perhaps that’s something to be taken into account during wine production.
There are a couple of things to reflection after this lengthy analysis. According to our findings, there aren’t a whole lot of qualities that affect how a wine is perceived to taste on their own. Rather, its a combination of factors that when added in the right amounts, can lead to a higher quality ranking in terms of taste. I found that alcohol content of a wine, in combination with concentration ranges of certain additives and preservatives are more likely to be indicators of whether or not a wine will taste great.
Despite the fact that there do seem to be trends that bad and good wines follow, they also don’t necessarily fit a linear regression model since their features don’t really have much of a linear relationship with one another. Instead, using a modeling technique such as random forest classification or K-nearest neighbors would be a better choice for further studies and for creating a model to predict where a wine will fall in the ranking.
Although we explored a lot of the data, there is still so much more that you can look at. However, it may be more useful to have different features of a wine to analyze than ones such as density, where you can look at other variables such as dates of production, location of production, information on the critics, and perhaps including more data from wines that have poor quality and high quality rankings.
Ultimately, trying to be able to predict peoples perception of wine is tricky. As we saw before, there is no one clear defining variable that makes or breaks a wine. This can be attributed to the fact that its difficult to quantify something so subjective as taste,where one persons taste can differ so much from another persons. This is a good start to try to see if you can predict what makes a wine good in terms of taste, but I believe that you’ll always have a problem trying to make assumptions from such a subjective variable.